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Abstract 


This paper shows a relationship between two different approximation techniques: the 
Support Vector Machines (SVM), proposed by V. Vapnik (1995), and a sparse ap¬ 
proximation scheme that resembles the Basis Pursuit De-Noising algorithm (Chen, 
1995; Chen, Donoho and Saunders, 1995). SVM is a technique which can be derived 
from the Structural Risk Minimization Principle (Vapnik, 1982) and can be used 
to estimate the parameters of several different approximation schemes, including Ra¬ 
dial Basis Functions, algebraic/trigonometric polynomials, B-splines, and some forms 
of Multilayer Perceptrons. Basis Pursuit De-Noising is a sparse approximation tech¬ 
nique, in which a function is reconstructed by using a small number of basis functions 
chosen from a large set (the dictionary). We show that, if the data are noiseless, the 
modified version of Basis Pursuit De-Noising proposed in this paper is equivalent to 
SVM in the following sense: if applied to the same data set the two techniques give 
the same solution, which is obtained by solving the same quadratic programming 
problem. In the appendix we also present a derivation of the SVM technique in the 
framework of regularization theory, rather than statistical learning theory, establish¬ 
ing a connection between SVM, sparse approximation and regularization theory. 
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1 Introduction 


In recent years there has been an increasing interest in approximation techniques that use the 
concept of sparsity to perform some form of model selection. By sparsity we mean, in very 
general terms, a constraint that enforces the number of building blocks of the model to be small. 
Sparse approximation often appears in conjunction with the use of overcomplete or redundant 
representations, in which a signal is approximated as a linear superposition of basis functions 
taken from a large dictionary (Chen, 1995; Chen, Donoho and Saunders, 1995; Olshausen and 
Field, 1996; Daubechies, 1992; Mallat and Zhang, 1993; Coifman and Wickerhauser, 1992). In 
this case sparsity is used as a criterion to choose between different approximating functions 
with the same reconstruction error, favoring the one with the least number of coefficients. The 
concept of sparsity has also been used in linear regression, as an alternative to subset selection, 
in order to produce linear models that use a small number of variables and therefore have greater 
interpretability (Tibshirani, 1994; Breiman, 1993). 

In this paper we discuss the relationship between an approximation technique based on the prin¬ 
ciple of sparsity and the Support Vector Machines (SVM) technique recently proposed by Vapnik 
(Vapnik, 1995; Vapnik, Golowich and Smola, 1996). SVM is a classification/approximation tech¬ 
nique derived by V. Vapnik in the framework of Structural Risk Minimization, which aims at 
building “parsimonious” models, in the sense of VC-dimension. Sparse approximation technique 
are also “parsimonious”, in the sense that they try to minimize the number of parameters of the 
model, so it is not surprising that some connections between SVM and sparse approximation 
exist. What is more surprising and less obvious is that SVM and a specific model of sparse 
approximation, which is a modified version of the Basis Pursuit De-Noising algorithm (Chen, 
1995; Chen, Donoho and Saunders, 1995), are actually equivalent, in the case of noiseless data. 
By equivalent we mean the following: if applied to the same data set they give the same solution, 
which is obtained by solving the same quadratic programming problem. While the equivalence 
between sparse approximation and SVM for noiseless data is the main point of the paper, we 
also include a derivation of the SVM which is different from the one given by V. Vapnik, and 
that fits very well in the framework of regularization theory, the same one which is used to derive 
techniques like splines or Radial Basis Functions. 

The plan of the paper is as follows: in section 2 we introduce the technique of SVM in the 
framework of regularization theory (the mathematical details can be found in appendix B). 
Section 3 introduces the notion of sparsity and presents an exact and approximate formulation 
of the problem. In section 4 we present a sparse approximation model, which is similar in spirit 
to the Basis Pursuit De-Noising technique of Chen, Donoho and Saunders (1995), and show 
that, in the case of noiseless data, it is equivalent to SVM. Section 5 concludes the paper and 
contains a series of remarks and observations. Appendix A contains some background material on 
Reproducing Kernel Hilbert Spaces, which are heavily used in this paper. Appendix B contains 
an explicit derivation of the SVM technique in the framework of regularization theory, and 
appendix C addresses the case in which data are noisy. 
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2 From Regularization Theory to Support Vector Ma¬ 
chines 

In this section we briefly sketch the ideas behind the Support Vector Machines (SVM) for regres¬ 
sion, and refer the reader to (Vapnik, 1995) and (Vapnik, Golowich and Smola, 1996) for a full 
description of the technique. The reader should be warned that the way the theory is presented 
here is slightly different from the way it is derived in Vapnik’s work. In this paper we will take a 
viewpoint which is closer to classical regularization theory (Tikhonov and Arsenin, 1977; Moro¬ 
zov, 1984; Bertero, 1986; Wahba, 1975, 1979, 1990), which might be more familiar to the reader, 
rather than the theory of uniform convergence in probability developed by Vapnik (Vapnik, 1982; 
Vapnik, 1995). A similar approach is described in (Smola and Scholkopf, 1998), although with 
a different formalism. In this section and in the following ones we will need some basic notions 
about Reproducing Kernel Hilbert Spaces (RKHS). For simplicity of exposition we put all the 
technical material about RKHS in appendix (A). Since the RKHS theory is very well developed 
we do not include many important mathematical technicalities (like the convergence of certain 
series, or the issue of semi-RKHS), because the goal here is just to provide the reader with a 
basic understanding of an already existing technique. The rigorous mathematical apparatus that 
we use can be mostly found in chapter 1 of the book of G. Wahba (1990) . 

2.1 Support Vector Machines 

The problem we want to solve is the following: we are given a data set D = {(x 8 -, J/;)}( =1 , obtained 
by sampling, with noise, some unknown function /(x) and we are asked to recover the function 
/, or an approximation of it, from the data D. We assume that the function / underlying the 
data can be represented as: 


/( X ) = C nM*) + b (!) 

n=l 

where {^ n (x)}))h 1 is a set of given, linearly independent basis functions, and c n and b are pa¬ 
rameters to be estimated from the data. Notice that if one of the basis functions (j) n is constant 
then the term b is not necessary. The problem of recovering the coefficients c n and b from the 
data set D is clearly ill-posed, since it has an infinite number of solutions. In order to make 
this problem well-posed we follow the approach of regularization theory (Tikhonov and Arsenin, 
1977; Morozov, 1984; Bertero, 1986; Wahba, 1975, 1990) and impose an additional smoothness 
constraint on the solution of the approximation problem. Therefore we choose as a solution the 
function that solves the following variational problem: 

minff[/] = C'tv(y, - /(x.)) + U{f] (2) 

feH i = l 2 

where V(x) is some error cost function that is used to measure the interpolation error (for 
example V(x) = x 2 ), C is a positive number, $[/] is a smoothness functional and 'H is the set of 
functions over which the smoothness functional $[/] is well defined. The first term is enforcing 
closeness to the data, and the second smoothness, while C controls the tradeoff between these 
two terms. A large class of smoothness functionals, defined over elements of the form (1), can 
be defined as follows: 
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( 3 ) 


°° 2 

m = Er 

n= 1 

where {A n }^h 1 is a decreasing, positive sequence. 

That eq. (3) actually dehnes a smoothness functional can be seen in the following: 

Example: Let us consider a one-dimensional case in which x £ [0,27r], and let us choose 
<f> n (x) = e mx , so that the c n are the Fourier coefficients of the function /. Since the sequence 
{A n }^h 1 is decreasing, the constraint that $[/] < oo is a constraint on the rate of convergence to 
zero of the Fourier coefficients c n , which is well known to control the differentiability properties 
of /. Functions for which $[/] is small have limited high frequency content, and therefore do 
not oscillate much, so that $[/] is a measure of smoothness. More examples can be found in 
appendix A. 

When the smoothness functional has the form (3) it is easy to prove (appendix B) that, inde¬ 
pendently on the form of the error function V, the solution of the variational problem (2) has 
always the form: 


/( x ) = o-iKi*, x 0 + h ( 4 ) 

i = l 

where we have defined the (symmetric) kernel function K as: 

OO 

JF( x ; y)=J2 ^n<f>n(x)<f>n( Y) (5) 

n=l 

The kernel K can be seen as the kernel of a Reproducing Kernel Hilbert Space (RKHS), a concept 
that will be used in section (4). Details about RKHS and examples of kernels can be found in 
appendix A and in (Girosi, 1997). 

If the cost function V is quadratic the unknown coefficients in (4) can be found by solving a 
linear system. When the kernel K is a radially symmetric function eq. (4) describe a Radial 
Basis Functions approximation scheme, which is closely related to smoothing splines, and when 
K is of the form K(x. — y) eq. (4) is a Regularization Network (Girosi, Jones and Poggio, 1995). 
When the cost function V is not quadratic anymore the solution of the variational problem (2) 
has still the form (4) (Smola and Scholkopf, 1998; Girosi, Poggio and Caprile, 1991), but the 
coefficients <q cannot be found anymore by solving a linear system. V. Vapnik (1995) proposed 
to use a particularly interesting form for the function V, which he calls the e-insensitive cost 
function , which we plot in figure (1): 


V ( x ) = \x 


0 if | m | < e 

lad — e otherwise. 


( 6 ) 


The e-insensitive cost function is similar to some of the functions used in robust statistics (Huber, 
1981), which are known to provide robustness against outliers. However the function (6) is not 
only a robust cost function, but also assigns zero cost to errors which are smaller then e. In other 
words, according to the cost function \x\ t any function that comes closer than e to the data points 
is a perfect interpolant. In a sense, the parameter e represents, therefore, the resolution at which 
we want to look at the data. When the e-insensitive cost function is used in conjunction with 
the variational approach of (2), one obtains the approximation scheme known as SVM, which 
has the form 
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I 

/(x, a, a*) = - aQ-K^x; x,-) + 6, (7) 

8 = 1 

where a* and cq are some positive coefficients which solve the following Quadratic Programming 
(QP) problem: 


min R(a* } a 

a,a* 


e V)(fo + a,-) - Y, yi( a * ~ a i) + 77 Z) ( a * 


8 = 1 


8 = 1 


*J=1 


subject to the constraints 


^T)A (xp Xj ) , 


( 8 ) 


0 < a*, a < C 

el«-«8) = o (9) 

cqa* = 0 Vz = 1,..., l 

Notice that the parameter b does not appear in the QP problem, and we show in appendix (B) 
that it is determined from the knowledge of ol and a*. It is important to notice that it is possible 
to prove that the last of the constraints above (cqa* = 0) is automatically satisfied by the solution 
and it could be dropped from the formulation. We include this constraint just because it will be 
useful in section 4. 

Due to the nature of this quadratic programming problem, only a number of coefficients a* — cq 
will be different from zero, and the input data points x 4 - associated to them are called support 
vectors. The number of support vectors depends on both C and e. The parameter C weighs the 
data term in functional (2) with respect to the smoothness term, and in regularization theory is 
known to be related to the amount of the noise in the data. If there is no noise in the data the 
optimal value for C is infinity, which forces the data term to be zero. In this case SVM will fold, 
among all the functions which have interpolation errors smaller than e, the one that minimizes 
the smoothness functional $[/]. The parameters C and e are two free parameters of the theory, 
and their choice is left to the user, as well as the choice of the kernel K, which determines the 
smoothness properties of the solution and should reflect prior knowledge on the data. For certain 
choices of K some well known approximation schemes are recovered, as shown in table (1). We 
refer the reader to the book of Vapnik (1995) for more details about SVM, and for the original 
derivation of the technique. 


Kernel Function 

Approximation Scheme 

JF(x;y) = exp(— x - y 2 ) 

Gaussian RBF 

fo(x;y) = (l + x-y) d 

Polynomial of degree d 

/F(x; y) = tanh(x • y — 9) 

(only for some values of 9) 

Multi Layer Perceptron 

K(x;y) = B 2n (x - y) 

B-splines 

K (z;y) = =5!i±mg3 

sin 1 

Trigonometric polynomial of degree d 


Table 1: Some possible kernel functions and the type of decision surface they define. The last 
two kernels are one-dimensional: multidimensional kernels can be built by tensor products of 
one-dimensional ones. The functions B n are piecewise polynomials of degree n, whose exact 
definition can be found in (Schumaker, 1981) 
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3 Sparse Approximation 

In recent years there has been a growing interest in approximating functions using linear su¬ 
perpositions of basis functions selected from a large, redundant set of basis functions, called 
dictionary. It is not the purpose of this paper to discuss the motivations that lead to this ap¬ 
proach, and refer the reader to (Chen, 1995; Chen, Donoho and Saunders, 1995; Olshausen and 
Field, 1996; Harpur and Prager, 1996; Daubechies, 1992; Mallat and Zhang, 1993; Coifman and 
Wickerhauser, 1992) for further details. A common aspects of these technique is that one seeks 
an approximating function of the form: 

n 

/( x ; a ) = X)w( x ) ( 10 ) 

i = l 

where <p = {(^> 8 (x)}( l _ 1 is a fixed set of basis functions that we will call dictionary. If n is very 
large (possibly infinite) and <p is not an orthonormal basis (for example it could be a frame, or 
just a redundant, finite set of basis functions) it is possible that many different sets of coefficients 
will achieve the same error on a given data set. A sparse approximation scheme looks, among all 
the approximating functions that achieve the same error, for the one with the smallest number 
of non-zero coefficients. The sparsity of an approximation scheme can also be invoked whenever 
the number of basis functions initially available is considered, for whatever reasons, too large 
(this situation arises often in Radial Basis Functions applied to a very large data set). 

More formally we say that an approximating function of the form (10) is sparse if the coefficients 
have been chosen so that they minimize the following cost function: 

n n 

El a,{] = ||/(x) - + *(£&)" (11) 

8=1 8=1 

where }™ =1 is a set of binary variables, with values in {0,1} ,|| • \\l 2 is the usual L 2 norm, and 
p is a positive number that we set to one unless otherwise stated. It is clear that, since the L 0 
norm of a vector counts the number of elements of that vector which are different from zero, the 
cost function above can be replaced by the cost function: 

n 

£[ a ] = ll/( x ) - X>8F8'( x )IIL + A ll a llL 0 ( 12 ) 

8 = 1 

The problem of minimizing such a cost function, however, is extremely difficult because it in¬ 
volves a combinatorial aspect, and it will be impossible to solve in practical cases. In order to 
circumvent this problem, approximated versions of the cost function above have been proposed. 
For example, in (Chen, 1995; Chen, Donoho and Saunders, 1995) the authors use the L\ norm 
as an approximation of the L 0} obtaining an approximation scheme that they call Basis Pursuit 
De-Noising. In related work, Olshausen and Field (1996) enforce sparsity by considering the 
following cost function: 


£[a] = Il/( x ) -5>8F8'( x )||i 2 (13) 

8=1 j=l 

where the function S was chosen in such a way to approximately penalize the number of non-zero 
coefficients. Examples of some the choices considered by Olshausen and Field (1996) are reported 
in table (2). 
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S(x) 

\x\ 

— exp(— x 2 ) 
log(l + x 2 ) 

Table 2: Some choices for the penalty function S in eq. (13) considered by Olshausen and Field 
(1996). 

In the case in which S(x) = |x|, that is the Basis Pursuit De-Noising case, it is simple to see 
how the cost function (13) is an approximated version of the one in (11). In order to see this, 
let us allow the variables f to assume values in { — 1,0,1}, so that the cost function (11) can be 
rewritten as 


= ||/(x)-X)6-w(x)||l 2 + ^Z)I^I • ( 14 ) 

8=1 8=1 

If we now let the variables f assume values over the all real line, and assuming that the coefficients 
ai are bounded , it is clear that the coefficients a 8 - are redundant, and can be dropped from the 
cost function. Renaming the variables f as ay we then have that the approximated cost function 
is 


E i a ] = ll/( x ) - ^>8'F8'(x) \\l 2 + A||a|| Ll , (15) 

8=1 

which is the one proposed in the Basis Pursuit De-Noising method of Chen, Donoho and Saunders 

(1995). 

4 An Equivalence Between Support Vector Machines 
and Sparse Coding 

The approximation scheme proposed by Chen, Donoho and Saunders, (1995) has the form de¬ 
scribed by eq. (10), where the coefficients are found by minimizing the cost function (15). We 
now make the following choice for the basis functions ipg. 

‘-pi (x) = A (x; Xj) Vz = 1,..., l 

where FT(x;y) is the reproducing kernel of a Reproducing Kernel Hilbert Space (RKHS) hi (see 
appendix A) and {(xy zy)}( =1 is a data set which has been obtained by sampling, in absence 
of noise , the target function /. We make the explicit assumption that the target function / 
belongs to the RKHS hi. The reader unfamiliar with RKHS can think of 77 as a space of smooth 
functions, for example functions which are square integrable and whose derivatives up to a certain 
order are also square integrable. The norm \\f\\f in this Hilbert space can be thought as a linear 
combination of the L 2 norm of the function and the L 2 norm of its derivatives (the specific degree 
of smoothness and the linear combination depends on the specific kernel K). It follows from eq. 
(10) that our approximating function is: 
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( 16 ) 


/'(x) = /(x; a) = ^ajA'(x;Xj) 

8 = 1 

This model is similar to the one of SVM (eq. 7), except for the constant 6, and if 77(x; y) = 
Cr(||x — y||), where G is a positive definite function, it corresponds to a classical Radial Basis 
Functions approximation scheme (Micchelli, 1986; Moody and Darken, 1989; Powell, 1992). 
While Chen et ah, in their Basis Pursuit De-Noising method, measure the reconstruction error 
with an L 2 criterion, we measure it by the true distance, in the 77 norm, between the target 
function / and the approximating function /*. This measure of distance, which is common in 
approximation theory, is better motivated than the L 2 norm because it not only enforces closeness 
between the target and the model, but also between their derivatives, since || • ||ft is a measure 
of smoothness. We therefore look for the set of coefficients a that minimize the following cost 
function: 

£[ a ] = ^ll/( x ) -I>^( x ; x OllH + e ll a lk ( 17 ) 

2 = 1 

where || • ||ft is the standard norm in 77. We consider this to be a modified version of the Basis 
Pursuit De-Noising technique of Chen (1995) and Chen, Donoho and Saunders (1995). 

Notice that it looks from eq. (17) that the cost function E cannot be computed because it 
requires the knowledge of / (in the first term). This would be true if we had || • \\l 2 instead of 
|| • ||ft in eq. (17), and it would force us to consider the approximation: 

||/( x ) -f( x )llL ~ y E(^-/*( x 0) 2 (!8) 

However, because we used the norm || • ||ft, we will see in the following that (surprisingly) no 
approximation is required, and the expression (17) can be computed exactly, up to a constant 
(which is obviously irrelevant for the minimization process). 

For simplicity we assume that the target function / has zero mean in 77, which means that its 
projection on the constant function g(x) = 1 is zero: 

</,l>H=0 

Notice that we are not assuming that the function g(x) = 1 belongs to 77, but simply that the 
functions that we consider, including the reproducing kernel 77, have a finite projection on it. In 
particular we normalize 77 in such a way that < l,77(x;y) >ft= 1. We impose one additional 
constraints on this problem: 

• We want to guarantee that the approximating function f* has also zero mean in 77: 

<r,i>ft=o ( 19 ) 

Substituting eq. (16) in eq. (19), and using the fact that 77 has mean equal to 1, we see that 
this constraint implies that: 


i 

Y a i = 0 
8 = 1 


( 20 ) 
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We can now expand the cost function E of equation (17) as 


m 


1 

2 


l y l l 

H - X a * < /( x ), A"( x ; x 0 >n +- XI a «’ a j < A"(x;x 8 ),/h(x;xj) > H +eX k 
8=1 2 i,j=l 8=1 


Using the reproducing property of the kernel K we have: 


< /(x),.K r (x;x t -) >h= /(x 8 ) = 2 /.- (21) 

< /h(x; Xj), Jh(x; Xj) >h= A"(x 8 ; Xj) (22) 

Notice that in eq. (21) we explicitly used the assumption that the data are noiseless, so that we 
know the value iji of the target function / at the data points x 4 -. We can now rewrite the cost 
functions as: 


l 1 l 1 1 

E [ a \ = dl/llw “ X a W + o X + eX N ( 23 ) 

z i= i z ij=i ;=i 

We now notice that the L\ norm of a (the term with the absolute value in the previous equation), 
can be rewritten more easily by decomposing the vector a in its “positive” and “negative” parts 
as follows: 


a = a + — a a + ,a >0, af a i = 0 Vz = 1,..., l. 

Using this decomposition we have 

ll a lk = Xk + + a i ) • ( 23: ) 

i = l 

Disregarding the constant term in \\f\\^ and taking in account the constraint (20), we conclude 
that the minimization problem we are trying to solve is equivalent to the following quadratic 
programming (QP) minimization problem: 

Problem 4.1 Solve: 


mm 

„+ 


l y l l 

- Xk + “ a 7 )yi + o X) k + - a 7 )(X - a J ) A "( x d Xj) + e X(X + u 


8 = 1 


i,j=1 


8=1 


subject to the constraints: 


a + , a 


EU {af ~ a] 


at a- 


> 0 

= 0 

= 0 Vz = 1,..., l 


(25) 


(26) 


If we now rename the coefficients as follows: 

-I- , ste 

af => a t 
a~ => a t 
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we notice that the QP problem defined by equations (25) and (26) is the same QP problem that 
we need to solve for training a SVM with kernel K (see eq. 8 and 9) in the case in which the 
data are noiseless. In fact, as we argued in section 2.1, the parameter C of a SVM should be set 
to infinity when the data are noiseless. Since the QP problem above is the same QP problem of 
SVM, we can use the fact that the constraint cqa* = 0 is automatically satisfied by the SVM 
solution (see appendix B) to infer that the constraint af a~ = 0 is also automatically satisfied in 
the problem above, so that it does not have to be included in the QP problem. Notice also that 
the constant term b which appears in (7) does not appear in our solution. We argue in appendix 
B that for most commonly used kernels K this term is not needed, because it is already implicitly 
included in the model. We can now make the following: 

Statement 4.1 Wh en the data are noiseless, the modified version of Basis Pursuit De-Noising 
of eq. (17), with the additional constraint (19), gives the same solution of SVM, and the solution 
is obtained by solving the same QP problem of SVM. 

As expected, the solution of the Basis Pursuit De-Noising is such that only a subset of the data 
points in eq. (16) has non-zero coefficients, the so-called support vectors. The number of support 
vectors, that is the degree of sparsity, is controlled by the parameter e, which is the only free 
parameter of this theory. 

5 Conclusions and remarks 

In this paper we showed that, in the case of noiseless data, SVM can be derived without using any 
result from VC theory, but simply enforcing a sparsity constraint in an approximation scheme of 
the form 

i 

/( x ; a ) = Xw * A "( x ; x 0 

i = l 

together with the constraint that, assuming that the target function has zero mean, the approxi¬ 
mating function should also have zero mean. This makes a connection between a technique such 
as SVM, which is derived in the framework of Structural Risk Minimization, and Basis Pursuit 
De-Noising, a technique which has been proposed starting from the principle of sparsity. Some 
observations are in order: 

• This result shows that SVM provide an interesting solution to an old-standing problem: the 
choice of the centers for Radial Basis Functions. If the number of data points is very large 
we do not want to place one basis function at every data point, but rather at a (small) 
number of other locations, called “centers”. The choice of the centers is often done by 
randomly choosing a subset of the data points. SVM provides a subset of the data points 
(the support vectors) which is “optimal” in the sense of the trade-off between interpolation 
error and number of basis functions (measured in the L\ norm). SVM can be therefore seen 
as a “sparse” Radial Basis Functions in the case in which the kernel is radially symmetric. 

• One can regard this result as an additional motivation to consider sparsity as an “interest¬ 
ing” constraint. In fact, we have shown here that, under certain conditions, sparsity leads 
to SVM, which is related to the Structural Risk Minimization principle, and is extremely 
well motivated in the theory of uniform convergence in probability. 
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• The result holds because in both this and Vapnik’s formulation the cost function contains 
both an “T 2 -type” and an “Ti-type” norm. However, the Support Vector method has an 
“Zi-type” norm in the error term, and an L 2 norm in the “regularization” term, while the 
cost function (17) we consider has an “T 2 -type” norm in the error term and an L\ norm in 
the “regularization” term. 

• This results holds due to the existence of the reproducing property of the RKHS. If the 
norm || • ||^ were replaced by the standard L 2 norm or any other Sobolev norm the cost 
function would contain the scalar product in L 2 between the unknown function / and the 
kernel Jh(x;x 8 ), and the cost function could not be computed. If we replace the RKHS 
norm with the training error on a data set {(xy J/;)}( =1 (as in Basis Pursuit De-Noising) the 
cost function could be computed, but it would lead to a different QP problem. Notice that 
the cost function contains the actual distance between the approximating and the unknown 
function, which is exactly the quantity that we want to minimize. 

• As a side effect, this paper provides a derivation of the SVM algorithm in the framework 
of regularization theory (see appendix B). The advantage of this formulation is that it is 
particularly simple to state, and it is easily related to other well known techniques, such 
as smoothing splines and Radial Basis Functions. The disadvantage is that it hides the 
connection between SVM and the theory of VC bounds, and does not make clear what 
induction principle is being used. When the output of the target function is restricted to 
be 1 or -1, that is we consider a classification problem, Vapnik shows that SVM minimize 
an upper bound on the generalization error , rather than minimizing the training error 
within a fixed architecture. Although this is rigorously proved only in the classification 
case, this is a very important property, that makes SVM extremely well founded from the 
mathematical point of view. This motivation, however, is missing when the regularization 
theory approach is used to derive SVM. 

• The equivalence between SVM and sparsity has only been shown in the case of noiseless 
data. In order to maintain the equivalence in the case of noisy data, one should prove that 
the presence of noise in the problem (17) leads to the additional constraint a* } a < C as 
in SVM, where C is some parameter inversely related to the amount of noise. In appendix 
C we sketch a tentative solution to this problem. This solution, however, is not very 
satisfactory because is purely formal, and it does not explain what assumptions are made 
on the noise in order to maintain the equivalence. 

Acknowledgments I would like to thank T. Poggio and A. Verri for their useful comments and B. 
Olshausen for the long discussions on sparse approximation. 

A Reproducing Kernel Hilbert Spaces 

In this paper, a Reproducing Kernel Hilbert Space (RKHS) (Aronszajn, 1950) is defined a Hilbert 
space of functions defined over some domain fi C R d with the property that, for each x £ 0, the 
evaluation functionals iF x defined as 


F*[f\ = /(x) V/ G ki 
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are linear, bounded functionals. It can be proved that to each RKHS Ti it corresponds a positive 
definite function Jh(x,y), which is called the reproducing kernel of 7i. The kernel of 7i has the 
following reproducing property. 

/(x) =< /(y), K{y, x) >H v/ e H (27) 

where < >% denotes the scalar product in 7i. The function K acts in a similar way to 

the delta function in L 2} although L 2 is not a RKHS (its elements are not necessarily defined 
pointwise). Here we sketch a way to construct a RKHS, which is relevant to our paper. The 
mathematical details (such the convergence or not of certain series) can be found in the theory of 
integral equations (Hochstadt, 1973; Cochran, 1972; Courant and Hilbert, 1962), which is very 
well established, so we do not discuss them here. In the following we assume that 0 = [0, V\ d for 
simplicity. The main ideas will carry over to the case 0 = R d , although with some modifications, 
as we will see in section (A.2). 

Let us assume that we find a sequence of positive numbers X n and linearly independent functions 
(j) n (x) such that they define a function JC(x;y) in the following way 1 : 

OO 

/C(x; y) = J2 ^n<f>n(x)<f>n(y) (28) 

n=l 

where the series is well defined (for example it converges uniformly). A simple calculation shows 
that the function K defined in eq. (28) is positive semi-definite. Let us now take as Hilbert 
space the set of functions of the form 


OO 

/( x ) = X) C ™<M X ) (29) 

n=l 

in which the scalar product is defined as: 

CO CO OO J 

< C rx<M X ), d n<f>n(x) >H = ( 30 ) 

71 = 1 71 = 1 71=1 U 

Assuming that all the evaluation functionals are bounded, it is now easy to check that such an 
Hilbert space is a RKHS with reproducing kernel given by _fT(x;y). In fact we have 

< /(x), A-(x;y) >„= £ C " A f" (y) = f>„@„(y) = /(y). 

n=l n n=l 

We conclude that it is possible to construct a RKHS whenever a function K of the form (28) is 
available. The norm in this RKHS has the form: 


= £ 

71=1 


(31) 


It is well known that expressions of the form (28) actually abound. In fact, it follows from 
Mercer’s theorem (Hochstadt, 1972) then any function _fT(x;y) which is the kernel of a positive 
operator 2 in L 2 (fl) has an expansion of the form (28), in which the (fg and the A; are respectively, 


^^When working with complex functions <^„(x) this formula should be replaced with A'(x;y) = 

E“=i A n bn(x)b7( y) 

2 We remind the reader that positive operators in L 2 are self-adjoint operators such that < Kf, f > > 0 for 
all / E L 2 - 
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the orthogonal eigenfunctions and the positive eigenvalues of the operator corresponding to K. In 
(Stewart, 1976) it is reported that the positivity of the operator associated to K is equivalent to 
the statement that the kernel K is positive definite, that is the matrix Kij = /F(xy Xj) is positive 
definite for all choices of distinct points x 4 -. Notice that a kernel K could have an expansion of 
the form (28) in which the (j) n are not necessarily its eigenfunctions. 

The case in which 0 = R d is similar, with the difference that the eigenvalues may assume any 
positive value, so that there will be a non-countable set of orthogonal eigenfunctions. In the 
following section we provide a number of examples of these different situations, that also show 
why the norm |/||^ can be seen as a smoothness functional. 

A.l Examples: RKHS over [0, 27r] 

Here we present a simple way to construct meaningful RKHS of functions of one variable over 
[0, 27 t] . In the following all the normalization factors will be set to 1 for simplicity. 

Let us consider any function K(x) which is continuous, symmetric, periodic, and whose Fourier 
coefficients X n are positive. Such a function can be expanded in a uniformly convergent Fourier 
series: 


An example of such a function is 


OO 

K(x) = \ n cos (nx 

n=0 


(32) 


OO 

K(x) = 1 + "^2 h n cos (nx ) == 

71 = 1 

where h £ (0,1). 

It is easy to check that, if (32) holds, then we have 

CO CO 

K(x — y) = 1 + X n sin (nx) sin (ny) + X n cos (nx) cos (ny) (33) 

71 = 1 71 = 1 

which is of the form (28) in which the set of orthogonal functions (j) n has the form: 


1 1 - h 2 

27T 1 — 2 h cos(x) + h 2 


= (1, sin(x), cos(x), sin(2x), cos(2 x),. . ., sin(nx), cos (nx ),. . .) . 

Therefore, given any function K which is continuous, periodic and symmetric we can then define 
a RKHS 7i over [0, 2tt] by defining a scalar product of the form: 


<f,9 >n= J2 


f c a c + f s o s 

Jncfn 1 JTicfr 


n=0 


A. 


where we use the following symbols for the Fourier coefficients of a function /: 


fn =< f, cos (nx) > , fn=< /, sin (nx) > 

The functions in 'H are therefore functions in X 2 ([0, 27r]) whose Fourier coefficients satisfy the 
following constraint: 


= £ 


71 = 0 


(fn ) 2 + (ny 


< +oo 


(34) 
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Since the sequence X n is decreasing, the constraint that the norm (34) has to be finite can be seen 
as a constraint on the rate of decrease to zero of the Fourier coefficients of the function /, which 
is known to be related to the smoothness properties of /. Therefore, choosing different kernels 
K is equivalent to choose RKHS of functions with different smoothness properties, and the norm 
(34) can be used as the smoothness functional $[/] in the regularization approach sketched in 
section 2. The relationship between the kernel K and the smoothness properties of the functions 
in the corresponding RKHS will become more clear in the next section, where we discuss the 
extension of this approach to the infinite domain 0 = R d . 


A.2 Examples: RKHS over R d 

When the domain 0 over which we wish to define a RKHS becomes the whole space R d most of 
the results of the previous section still apply, with the difference that the spectrum of K becomes 
(usually) the whole positive axis, and it is not countable anymore. 

For translation invariant kernels, that is positive definite functions of the form K(x. — y), the 
following decomposition holds: 

KCx — y) = f dsK(s)e tS ' x e~ tS ' y (35) 

JR d 

Equation (35)is the analog of (28) over an infinite domain, and one can go from the case of 
bounded 0 to the case of 0 = R d by the following substitutions: 


n => s 
X n => K( s) 

M*) =* e* s ' x 

i =► ds 

We conclude then that any positive definite function of the form K(x. — y) defines a RKHS over 
R d by defining a scalar product of the form 


< f,9 >n 


ds 


f( s )g *( s ) 

K(s) 


The reproducing property of K is easily verified: 


(36) 


< /(x),A"(x-y) >= j ds 


f(s)K(s)e-^ 

K(s) 


= /(y) 


and the RKHS becomes simply the subspace of L 2 (R d ) of the functions such that 


WfWl = ( ds < +~ (37) 

J K (s) 

Functionals of the form (37) are known to be smoothness functionals. In fact, the rate of decrease 
to zero of the Fourier transform of the kernel will control the smoothness property of the function 
in the RKHS. Consider for example, in one dimension, the kernel K(x) = e - ^, whose Fourier 
Transform is K(s) = (1 + s 2 ) -1 . The RKHS associated to this kernel contain functions such 



1/001 

(1 + s 2 )- 


+ 11 f 


2 

Ij2 


< OO 
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This is the well known Sobolev space W 2 , where we denote by W' 2 the set of functions whose 
derivatives up to order m are in L 2 (Yosida, 1974). Notice that the norm induced by the scalar 
product (36) is the smoothness functional considered by Girosi, Jones and Poggio (1995) in their 
approach to regularization theory for function approximation. This is not surprising, since RKHS 
have been known to play a central role in spline theory (Wahba, 1990). Notice also that in spline 
theory one actually deals with semi-RKHS, in which the norm || • ||^ has been substituted with a 
semi-norm. Semi-RKHS share most of the properties of RKHS, but their theory becomes a little 
more complicated because of the null space of the semi-norm, which has to be taken in account. 
Details about semi-RKHS can be found in (Wahba, 1990). 


A.3 Finite Dimensional RKHS 

When the set of basis functions fa has finite cardinality iV, the construction of a RKHS sketched 
in the previous section (eq. 1 and 30) is always well defined, as long as the basis functions are 
linearly independent. Notice that the functions fa do not have to be orthogonal, and that they 
will not be the eigenfunctions of K. It is interesting to notice that in this case we can define a 
different set of basis functions fa, which we call the dual basis functions, with some interesting 
properties. The dual basis functions are defined as: 

N 

<M x ) = M ijVj( x ) ( 38 ) 

i,j = 1 

where iff -1 is the inverse of the matrix M: 


Mij =< fa,(f>j > (39) 

and the scalar product is taken in L 2 . It is easy to verify that, for any function of the space, the 
following identity holds: 

N N 

/( x ) = J2 < /> fa > <M x ) = Z) < /> & > <M x ) ( 40 ) 

i = l 8 = 1 

where the second part of the identity comes from the fact that the dual basis of the dual basis is 
the original basis. From here we conclude that, for any choices of positive Ay the set of functions 
spanned by the functions fa form a RKHS whose norm is 


N 


2 _ 
n 


= E 


8 = 1 


< /, > 2 

A,- 


(41) 


Notice that while the elements of the (dual) basis are not orthogonal to each other, orthogonality 
relationships hold between elements of the basis and the elements of the dual basis: 


(> ' • (> j f ij 

As a consequence, it is also possible to show that, defining the dual kernel as: 


N 

ih( x ;y) = ^A t -<Mx)<My) 

8 = 1 

the following relationships hold: 


(42) 


( 43 ) 
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J dy K(x-,y)(f>i(y) = A;<^-(x) 

J dy K(x]y)<j>i(y) = \i<j>i(x.) 

A.3.1 A RKHS of polynomials 

In this section we present a particular RKHS, which allows us to partially answer to an old stand¬ 
ing question: is it possible to derive, in the framework of regularization theory, an approximation 
scheme of the form: 


/( x ) = X • w * + e ) ( 44 ) 

i = l 

where a is some continuous, one dimensional function? Connections between approximation 
schemes of the form (44) and regularization theory have been presented before (Girosi, Jones, 
and Poggio, 1995), but always involving approximation or extension of the original regularization 
theory framework. A positive answer to the previous question is given by noticing that it is 
possible to define a RKHS whose kernel is 


A(x;y) = (1 + x-y) d 

where d is any integer. In fact, the kernel K above has an expansion of the form: 

N 

/C(X; y) = J2 ^n<f>n(x)<f>n(Y) (45) 

n=l 

where the (j) n are the monomials of degree up to d, which constitutes a basis in the set of 
polynomials of degree d, and the A; are some positive numbers. The kernel above therefore can 
be used to give the structure of RKHS to the set of polynomials of degree d (in arbitrary number 
of variables), and the norm defined by eq. (41) can be used a smoothness functional in the 
regularization theory approach (see appendix B) to derive an approximating scheme of the form: 

/( x ) = X^l 1 + x • x *) d 

8 = 1 

which is a special case of eq. (44). The fact that the kernel A(x; y) = (1 + x • y) d has an 
expansion of the form (45) had been reported by Vapnik (1995). Here we give some examples, 
from which the reader can infer the general case. 

Let us start from the one-dimensional case, where one chooses: 


<f> n (x) = x n n = 0,. . . , d 


= X A n<t>n{x)<t>n{y) = X ^ ^ 

n=0 n=0 \ 

A similar result, although with a more complex structure of the coefficients A n , is true in the 
multivariate case. For example in two variables we can define: 


K(x;y) 


j( Xy ) n = (l + Xy ) d 


A r; — 



It is now easy to see that 
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{</> 8 (x)}® =1 = (l 7 X l7 X 27 X 1 X 27 X 2 l7 X 2 2 ) 

{\i}f =1 = (l,V2 ,V2 ,V2 1,1) 

In this case it is easy to verify that: 

Jh(x; y) = A n <M x )<My) = (1 + x • y) 2 (46) 

n= 1 

The same kernel can be obtained in 3 variables by choosing: 

{(^(x)}^ = {1,X 1 ,X 2 ,X3,X\,XI,X\,X 1 X 2 ,X 1 X3, X 2 X 3 ) 

= ( 1 , y/2, V2 , V2 , 1 , 1 , 1 , ^2 , V2 , ) 

In 2 variables, we can also make the following choice: 

{</> 8 (x)} 8 9 =1 = (1, x l7 x 2 , x\, x\, x 1 x 2 , x\x 2 , Xix\, ,x\ ) 

{A,-}? = 1 = (1, V3 , V3 , V3 , V3 , V6, V3 , V3 , 1 , 1) 

and it is easy to see that: 

9 

Jh(x; y) = J2 K<t>n{*)<t>n{y) = (1 + X • y) 3 

71 = 1 

While it is difficult to write a closed formula for the coefficients \ in the general case, it is 
obvious that such coefficients can always be found. This, therefore, leaves unclear what form of 
smoothness is imposed by using the norm in this particular RKHS, and shows that this example 
is quite an academic one. A much more interesting case is the one in which the function a in 
eq. (44) is a sigmoid or some other activation function. If it were possible to fold an expansion 
of the form: 

OO 

cr(x • y + 6») = A„<^„(x)<^„(y) 

71 = 1 

for some fixed value of 0, this would mean that a scheme of the form 

n 

/(x) = ^c 8 cr(x • Xj + 6) 

i = l 

could be derived in a regularization theory framework. Vapnik (1995) reports that if cr(x) = 
tanh(x) the corresponding kernel is positive definite for some values of 0, but the observation is 
of experimental nature, so that we do not know what the A; or the (j) n are, and therefore we do 
not what kind of functions the corresponding RKHS contains. 
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B Derivation of the SVM Algorithm 

B.l Generalities on Regularization Theory 

Let us look more closely at the solution of the variational problem (2): 

n*sW = C E v(y, - /(x,)) + i*l/l 

fen i=i 2 

We assume that 'H is a RKHS with kernel K and that the smoothness functional $[/] is: 


m = \\f\\ 2 n 


This is equivalent to assume that the functions in 'H have a unique expansion of the form: 


OO 

/( X ) = Y C nM*) 

n =1 

and that their norm is: 


\\f\\n = Y 


c 


2 

n 


X 


n 


In this derivation we do not have the coefficient b which appears in (1), since we argued before 
that if one of the fa is constant, which is usually the case, this term is not necessary. 

We can think of the functional H\f] as a function of the coefficients c n . In order to minimize 
H[f] we take its derivative with respect to c n and set it equal to zero, obtaining the following: 


~ C Y V'iVt ~ /( X 0)<M X 0 + W = 0 

i=i 

Let us now define the following set of unknowns: 


(47) 


a, = CV(y, - /(x ,)) 

Using eq. (47) we can express the coefficients c n as a function of the ay 

i 

— X n 'y a,i<f> n (Xi) 

i =1 

The solution of the variational problem has therefore the form: 

OO OO l l 

/(x) = c n fan{*-) = Y X! a ^<Mx 8 )<Mx) = Xw* A "( x ; x 0 ( 48 ) 

n— 1 n=l 2 = 1 2 = 1 

where we have used the expansion (28). This shows that, independently of the form of V, the 
solution of the regularization functional (2) is always a linear superposition of kernel functions, 
one for each data point. The cost function V affects the computation of the coefficients a 8 -. In 
fact, plugging eq. (48) back in the definition of the a 8 - we obtain the following set of equations 
for the coefficients ay 
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= CV ' 


- E Ki 

3 = 1 


= 1, 


where we have defined Kij = i^(x 4 -;Xj). In the case in which V(x) = x 2 we obtain the standard 
regularization theory solution (see Girosi, Jones and Poggio, 1995 for an alternative derivation): 


where we have defined 7 = A 


(K + 7/) a = y 


B.2 The SVM algorithm in the Regularization Theory Framework 

Following Vapnik (1995) we now consider the case of the e-insensitive cost function V(x) = \x\ t . 
In this case the approach sketched above is problematic because V is not differentiable at x = e 
(although it still makes sense everywhere else). In order to make our notation consistent with 
Vapnik’s one, we have to modify slightly the model proposed in the previous section. Vapnik 
explicitly takes into account an offset in the model, so that equation (1) is replaced by 

OO 

/( x ) = E c >A( x ) + b ( 49 ) 

n=l 

The smoothness functional remains unchanged (so that the smoothness does not depend on b ): 

00 2 

m = Ef 

n =1 An 

Also, we scale the functional in (2) of a factor A- = C, obtaining the following variational 
problem: 


min H [ f ] = GE \Vi ~ + 7;$[/] 

fen i=i 2 

Since it is difficult to deal with the function V(x) = |x| e , the problem above is replaced by the 
following equivalent 3 problem, in which an additional set of variables is introduced: 


subject to 


min/f[/]=CEte+fi) + 4*[/l 

feH i=l 2 


/( x 0 - yi 

< 

e + Ci 

i = i,. 


yi - /( x 0 

< 

z + a 

i = i,. 



> 

0 

i = i,. 


>>i 

> 

0 

i = i,. 



(50) 


(51) 


The equivalence of the variational problem is established just noticing that in the problem above 
a (linear) penalty is paid only when the absolute value of the interpolation error exceeds e, which 
correspond to Vapnik’s e-insensitive cost function. Notice that when of the two top constraints is 


3 By equivalent we mean that the function that minimizes the two functionals is the same 
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satisfied with some non-zero (or £*), the other is automatically satisfied with a zero value for 
£* (or £i). In order to solve the constrained minimization problem above we use the technique 
of Lagrange multipliers. The Lagrangian corresponding to the problem above is: 

£(/>£>£*;«,«*, r,C) = + + + i)-£-(*) + 

2=1 2=1 

+ I] «*(/( x 0 -yi-e- Ci) - Kr t -£ + r*C*) (52) 

i=l 8=1 

where agad,r,r* are positive Lagrange multipliers. The solution of the constrained variational 
problem above is now obtained by minimizing the Lagrangian (52) with respect to / (that is 
with respect to the c n and to &), £ and and maximizing (in the positive quadrant) with respect 
to aqad,r,r*. Since the minimization step is now unconstrained, we set to zero the derivatives 
with respect to c n , b } £ and £*, obtaining: 


()C 

dc n 


i 

Cn 'C ^ ^ 1 / ^(jn (-^-8 ) 

8=1 


dC 

db 

dC 

Wn 

dC 

9Cn 


0 

0 

0 


IK - “0 = 0 

8 = 1 

Y n. — C OLn 


C-a 


* 

n 


Substituting the expression for the coefficients c n in the model (49) we then conclude that the 
solution of the problem (50) is a function of the form 


/( x ) = IK - «8:)A"(x; x 8 ) + b (53) 

8 = 1 

Substituting eq. (53) in the Lagrangian, we obtain an expression that should now be maximized 
(in the positive quadrant) with respect to a, a*, r, r*, with the additional constraints listed above. 
Noticing that the relationship between r n (r*) and a n (a*) implies that a < C and a* < C, 
and minimizing — C rather than maximizing £, we now obtain the following QP problem: 


Problem B.l 

l l y l 

min £(a, a*) = e £K + a.’) - I Vi(a* ~ a t ) + - I (a* - a t )(a* - Xj), 

a.a 8=i 8=i 2 ,-,j=i 

subject to the constraints 


0 <a*,a<C 


E(X - = o 
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This is the QP problem that has to be solved in order to compute the SVM solution. It is useful 
to write and discuss the Kuhn-Tucker conditions: 


1 

SJ 

1 

o 

II 

1 = 1 ,. 


* • 

1 

SJ 

1 

o 

II 

1 = 1 ,. 


(C 1 - cp)Q 

o 

II 

1 = 1 ,. 


(C 1 - «?)£ 

o 

II 

1 = 1 ,. 



The input data points x 4 - for which cq or a* are different from zero are called support vectors. 
Few observations are in order: 

• The Lagrange multipliers cq and a* cannot be simultaneously different from zero, so that 
the constraint cqcq* = 0 holds. 

• The support vectors are those data points x 4 - at which the interpolation error is either 
greater or equal to e. Points at which the interpolation error is smaller than e are never 
support vectors, and do not enter in the determination of the solution. Once they have 
been found, they could be removed from the data set, and if the SVM were run again on 
the new data set the same solution would be found. 

• Any of the support vectors for which 0 < cp < C (and therefore Q = 0) can be used to 
compute the parameter b. In fact, in this case it follows from the Kuhn-Tucker conditions 
that: 

i 

f (xi) = x j) + b = yi + e 

j=i 

(a similar argument holds for the a*). 

• If e = 0 then all the points become support vectors; 

• Because of the constraint cqa* = 0, defining 

a = a* — a 

and using eq. (24) the QP problem B.l can be written as follows: 

Problem B.2 

minif*[a] = cHaH^ — a • y + -a • K a 

subject to the constraints 


-C <a t <C 

a • 1 = 0 

Important note: Notice that if one the basis functions (fy is constant, then the parameter b in 
(49) could be omitted. The RKHS described in appendix A all have this property. 
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C Noisy case: an equivalence? 

It is natural to ask whether the result of this paper can be extended to the case of noisy data. I 
will sketch here an argument to show that there is still a relationship between SVM and sparse 
approximation, when data are noisy, although the relationship is much less clear. In the presence 
of additive noise we have 


/(x,-) = yi + 8i , 

where jq are the measured value of /, and Si are random variables with unknown probability 
distribution. Substituting jq with jq + Si in eq. (23), disregarding the constant term in ||/||^, 
and defining 


i ^ i i 

E [ci] ^ ) &iyi ~\~ X - ^ ) (ii(ijl T (xq Xj) e ^ ) \a>i | 

i =1 Z i,j=l i =1 

we conclude that we need to minimize the following QP problem: 

Problem C.l 

min [if* [a] — a • 5] 

subject to the constraint: 

a • 1 =0 

where the vector 6 is unknown. 

In order to understand how to deal with the fact that we do not know let us consider a different 
QP problem: 

Problem C.2 

min if* [a] 
a L J 

subject to the constraints: 


a • 1 =0 

a > rj 

a < 77 * 

where the box parameters rj and rj* are unknown. 

We solve problem C.2 using the Lagrange multipliers technique for the inequality constraints, 
obtaining the following dual version of problem C.2: 

Problem C.3 

max min [if* [al — a • (3 — 3*) + 3 ■ ri — 3* ■ rj*] 

(3,(3 * a 

subject to the constraint: 

a • 1 =0 

/ 3 ,( 3 * >0 

where (3 and (3* are vectors of Lagrange multipliers. 
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Notice now that the choice of the box parameters rj and rj* uniquely determines (3 and /3*, 
and that setting 6 = (3 — /3*, problems C.l and C.3 are identical for what concerns the a 
vector: they both require to solve a QP problem in which a linear term contains unknown 
coefficients. Therefore, solving problem C.l with unknown 6 seems to be formally equivalent to 
solving problem C.3 with unknown box parameters. This suggests the following argument: 1) 
solving C.l with unknown 6 is formally equivalent to solving problem C.3 with unknown box 
parameters; 2) in absence of any information on the noise, and therefore on the box parameters, 
we could set the box parameters to rj* = —rj = Cl for some unknown C ; 3) for rj* = —rj = Cl 
problem C.3 becomes the usual QP problem of SVM (problem B.l); 4) therefore, in total absence 
of information on the noise, problem C.l leads to the same QP problem of SVM, making the 
equivalence between sparse approximation and SVM complete. However this argument is not 
very rigorous, because it does not make clear how the assumptions on rj and rj* are reflected on 
the noise vector 6. However, the formal similarity of the problems C.3 and C.l seems to point in 
the right direction, and an analysis of the relationship between 77, 77* and 6 could lead to useful 
insights on the assumptions which are made on the noise in the SVM technique. 
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